Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: locally index list of checksums available on cloud remotes #3634

Merged
merged 32 commits into from
Apr 21, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Apr 15, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Index changes from #3600.
Related to #2147.

Docs PR: iterative/dvc.org#1163

New behavior

  • Checksums available on the remote are indexed (cached) locally in a sqlite3 database. The db file will be stored as .dvc/tmp/index/<filename>.idx where filename is the SHA256 digest of the remote URL.
    • Only .dir checksums, and checksums for files contained inside those directories will be indexed
    • Standalone files will not be indexed
  • On any dvc command which queries remote status (status -c/push/pull/fetch) index will be updated to include any new .dir's which are found on the remote (or are pushed successfully to the remote)

Implementation details

  • remote.status/remote.cache_exists
    1. Validate our index by querying for .dir checksums on the remote. If a .dir checksum is in our index, but not the remote (i.e. someone else has run gc -c), clear the entire index.
    2. If a .dir checksum exists on the remote, trust that directory file contents also exists on the remote (unchanged from current behavior)
    3. Update index with any new directories found on the remote (i.e. dirs that someone else has push'd since we last queried remote status)
    4. Query remote for standalone files (unchanged from current behavior)
  • pull/fetch: If any index related mismatch/error occurs (i.e. trying to download a file that is in our index but not on the remote), clear the entire index.
  • push: After push operation, successfully uploaded .dir checksums and file contents checksums for the directory contents are added to the index. Files uploaded during a partially failed push will not be indexed
  • gc -c: Re-index will be required after a gc -c operation
  • Changes to the index database are only committed once, after all operations on the remote have finished.

Todo

  • use .dir checksum existence on remote to validate/invalidate index
  • unit tests for index module
  • functional tests for new behavior
  • docs
  • benchmarks
  • changes from 4/17 discussion

@pmrowla pmrowla added the performance improvement over resource / time consuming tasks label Apr 15, 2020
@pmrowla pmrowla self-assigned this Apr 15, 2020
Comment on lines +63 to +78
INDEX_TABLE = "remote_index"
INDEX_TABLE_LAYOUT = "checksum TEXT PRIMARY KEY, " "dir INTEGER NOT NULL"
Copy link
Contributor Author

@pmrowla pmrowla Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using a database, would it be better to have a single .dvc/tmp/index db, with appropriate tables/columns for handling multiple remotes, vs having multiple single-table .dvc/tmp/index/... dbs?

I'm not sure how common it is for users to have multiple remotes configured, but in the case where you have multiple large remotes, as it is right now we would be duplicating data across multiple (potentially large) index db files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla If we are talking about storing unpacked *.dir cache files, then there doesn't seem to be a point to store them for each remote, since no matter the remote if it has *.dir present we trust that it also has all cache files from within it. So maybe it is a premature optimization for later general indexes for whole remotes.

dvc/remote/local.py Outdated Show resolved Hide resolved
@pmrowla pmrowla marked this pull request as ready for review April 17, 2020 06:50
@pmrowla pmrowla changed the title [WIP] remote: locally index list of checksums available on cloud remotes remote: locally index list of checksums available on cloud remotes Apr 17, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 17, 2020

Some benchmarks from my machine (dual-core i7-5557U @ 3.10GHz, osx catalina):

S3 remote, 2M imagenet files, w/existing state database, but no pulled/fetched files

latest PR branch, with no existing index file (equivalent to first time running status -c or forced re-index via --clear-index)

time dvc status -c
517.33s user 103.87s system 81% cpu 12:46.00 total

latest PR branch, with existing index file

time dvc status -c
473.61s user 80.07s system 90% cpu 10:12.18 total

latest master

time dvc status -c
478.89s user 83.17s system 86% cpu 10:49.67 total

0.91.0 default (pre-optimizations)

todo

0.91.0 no-traverse = False

time dvc status -c
802.44s user 87.22s system 42% cpu 35:04.45 total

rclone (w/sanitized remote url)

time rclone sync --dry-run remote:remote/ .dvc/cache/
141.96s user 21.53s system 64% cpu 4:11.82 total

The traverse improvements from 0.91.0 to now should be more drastic on machines with more cores.
Also, it would be ideal if we had a convenient way to only benchmark the cloud/remote steps to see how we compare to tools like rclone

@efiop
Copy link
Contributor

efiop commented Apr 17, 2020

@pmrowla And what about the case when we add 1(or just n << N) new file to the dataset? I think it should also shine there, compared to what we had before index.

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some unnecessary fixture includes

tests/func/remote/test_index.py Outdated Show resolved Hide resolved
tests/func/remote/test_index.py Outdated Show resolved Hide resolved
tests/func/remote/test_index.py Outdated Show resolved Hide resolved
@efiop efiop changed the title remote: locally index list of checksums available on cloud remotes [WIP] remote: locally index list of checksums available on cloud remotes Apr 17, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 17, 2020

Had discussion with @efiop, a couple of changes needed before this can be merged.

  1. Keep behavior from current master regarding trusting .dir files on the remote
  2. For push, only update index when entire dir has been uploaded successfully

Will also check cprofile results and update benchmarks

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 20, 2020

Previous benchmarks post has been updated with current master and PR branch numbers for status -c. Whenever there is no pre-existing index db (or when --clear-index is used to force reindexing), adding the initial 2M checksums to the db does add ~2 minutes of overhead, but after the index exists, the overhead for checking the existing index is negligible.

Running with cProfile, pathinfo handling code is a large bottleneck for us when checking cloud status, particularly in dvc/stage/base.py::_collect_used_dir_cache()

from python -m cProfile -o status.prof -m dvc status -c with an existing index db:

Screen Shot 2020-04-20 at 3 07 59 PM

Screen Shot 2020-04-20 at 3 18 02 PM

Screen Shot 2020-04-20 at 3 16 06 PM

zip w/cprofile output:
cprofile.zip

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 20, 2020

Benchmarks for status -c after adding 1 file to an existing pushed (and indexed) dir with 100k files.

S3 remote, 100k total files in remote cache

latest PR branch

time dvc status -c
13.58s user 5.91s system 82% cpu 23.671 total

latest master

time dvc status -c
41.61s user 8.32s system 84% cpu 59.136 total

0.91.0 default

time dvc status -c
423.96s user 26.00s system 37% cpu 19:53.69 total

0.91.0 no_traverse = False

time dvc status -c
33.46s user 7.08s system 36% cpu 1:49.62 total

rclone

time rclone sync --dry-run .dvc/cache s3://...
103.33s user 67.31s system 7% cpu 36:59.68 total

dvc/remote/base.py Outdated Show resolved Hide resolved
@pmrowla pmrowla changed the title [WIP] remote: locally index list of checksums available on cloud remotes remote: locally index list of checksums available on cloud remotes Apr 20, 2020
@shcheklein
Copy link
Member

@pmrowla amazing stuff πŸ‘ - ~7 times faster than rclone.

I'm curious where do we spend 13seconds though in this scenario πŸ€”

@shcheklein
Copy link
Member

@pmrowla is it pathinfo handling code is a large bottleneck for us when checking cloud status, particularly in dvc/stage/base.py::_collect_used_dir_cache()?

@shcheklein
Copy link
Member

Also, out of curiosity - do we actually need --drop-index option? A bit worried that we provide it to every data management related command.

dvc/command/data_sync.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Apr 20, 2020

Also, out of curiosity - do we actually need --drop-index option? A bit worried that we provide it to every data management related command.

Thinking about it, looks like current --clear-index simply drops the index for specific remote. So maybe it would be more logical do create dvc remote clear-index(maybe there is a better name) instead. There doesn't seem to be a good use case from the user point of view, but it is really helpful for us when debugging or providing support. That being said, we could always ask user to rm -rf .dvc/tmp/index instead. So looks like it is indeed not strictly necessary but handy.

Looks like we could indeed drop it for now until someone asks. I think it was me who asked for this flag awhile ago, but it no longer looks like a great idea to me, I think i was thinking about something like --ignore-index originally, but I don't see a good use case for that neither.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 21, 2020

--clear-index has been removed and the associated docs PR is closed

dvc/remote/index.py Outdated Show resolved Hide resolved
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”₯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants